Challenge 1 Solution

Reading in data and creating a post
challenge_1
solution
Author

Sean Conway

Published

December 23, 2023

Code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(readxl)
library(here)
here() starts at /Users/seanconway/Github/DACSS_601_W24
Code
knitr::opts_chunk$set(echo = TRUE)

Challenge Overview

Today’s challenge is to

  1. read in a dataset, and

  2. describe the dataset using both words and any supporting information (e.g., tables, etc)

Read in the Data

Read in one (or more) of the following data sets, using the correct R package and command.

You should have already downloaded the datasets from Google Classroom and stored them in a common directory on your computer.

In this challenge, as in all subsequent challenges, the number of stars corresponds to the difficulty of the dataset. You are only required to do the challenge on one dataset, though you are welcome to do it with multiple datasets.

In general, I encourage you to “challenge” yourself by trying to work with a dataset above your experience.

  • railroad_2012_clean_county.csv
  • birds.csv ⭐⭐
  • FAOstat\*.csv ⭐⭐
  • wild_bird_data.xlsx ⭐⭐⭐
  • StateCounty2012.xls ⭐⭐⭐⭐

Add any comments or documentation as needed. More challenging data sets may require additional code chunks and documentation.

It is hard to get much information about the data source or contents from a .csv file - as compared to the formatted .xlsx version of the same data described below.

Read the Data

Code
railroad <- here("posts","_data","railroad_2012_clean_county.csv") %>%
  read_csv()
Rows: 2930 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): state, county
dbl (1): total_employees

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
railroad

From inspection, we can that the three variables are named state, county, and total employees. Combined with the name of the file, this appears to be the aggregated data on the number of employees working for the railroad in each county 2012. We assume that the 2930 cases - which are counties embedded within states1 - consist only of counties where there are railroad employees?

Code
railroad %>%
  select(state) %>%
  n_distinct(.)
[1] 53
Code
railroad%>%
  select(state)%>%
  distinct()

With a few simple commands, we can confirm that there are 53 “states” represented in the data. To identify the additional non-state areas (probably District of Columbia, plus some combination of Puerto Rico and/or overseas addresses), we can print out a list of unique state names.


1: We can identify case variables because both are character variables, which in tidy lingo are grouping variables not values.

Once again, a .csv file lacks any of the additional information that might be present in a published Excel table. So, we know the data are likely to be about birds, but will we be looking at individual pet birds, prices of bird breeds sold in stores, the average flock size of wild birds - who knows!

The FAOSTAT*.csv files have some additional information - the FAO - which a Google search reveals to be the Food and Agriculture Association of the United Nations publishes country-level data regularly in a database called FAOSTAT. So my best guess at this point is that we are going to be looking at country-level estimates of the number of birds that are raised for eggs and poultry, but we will see if this is right by inspecting the data.

We’re also lumping in the birds.csv dataset here, because it comes from the same source.

Read the Data

Code
birds <- here("posts","_data","birds.csv") %>%
  read_csv()
Rows: 30977 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Domain Code, Domain, Area, Element, Item, Unit, Flag, Flag Description
dbl (6): Area Code, Element Code, Item Code, Year Code, Year, Value

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
chickens <- here("posts","_data","FAOSTAT_egg_chicken.csv") %>%
  read_csv()
Rows: 38170 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Domain Code, Domain, Area, Element, Item, Unit, Flag, Flag Description
dbl (6): Area Code, Element Code, Item Code, Year Code, Year, Value

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
cattle <- here("posts","_data","FAOSTAT_cattle_dairy.csv") %>%
  read_csv()
Rows: 36449 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Domain Code, Domain, Area, Element, Item, Unit, Flag, Flag Description
dbl (6): Area Code, Element Code, Item Code, Year Code, Year, Value

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
country <- here("posts","_data","FAOSTAT_country_groups.csv") %>%
  read_csv()
Rows: 1943 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Country Group, Country, M49 Code, ISO2 Code, ISO3 Code
dbl (2): Country Group Code, Country Code

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
livestock <- here("posts","_data","FAOSTAT_livestock.csv") %>%
  read_csv()
Rows: 82116 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Domain Code, Domain, Area, Element, Item, Unit, Flag, Flag Description
dbl (6): Area Code, Element Code, Item Code, Year Code, Year, Value

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
birds
Code
chickens
Code
cattle
Code
country
Code
livestock
Code
glimpse(birds)
Rows: 30,977
Columns: 14
$ `Domain Code`      <chr> "QA", "QA", "QA", "QA", "QA", "QA", "QA", "QA", "QA…
$ Domain             <chr> "Live Animals", "Live Animals", "Live Animals", "Li…
$ `Area Code`        <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ Area               <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afgha…
$ `Element Code`     <dbl> 5112, 5112, 5112, 5112, 5112, 5112, 5112, 5112, 511…
$ Element            <chr> "Stocks", "Stocks", "Stocks", "Stocks", "Stocks", "…
$ `Item Code`        <dbl> 1057, 1057, 1057, 1057, 1057, 1057, 1057, 1057, 105…
$ Item               <chr> "Chickens", "Chickens", "Chickens", "Chickens", "Ch…
$ `Year Code`        <dbl> 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 196…
$ Year               <dbl> 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 196…
$ Unit               <chr> "1000 Head", "1000 Head", "1000 Head", "1000 Head",…
$ Value              <dbl> 4700, 4900, 5000, 5300, 5500, 5800, 6600, 6290, 630…
$ Flag               <chr> "F", "F", "F", "F", "F", "F", "F", NA, "F", "F", "F…
$ `Flag Description` <chr> "FAO estimate", "FAO estimate", "FAO estimate", "FA…
Code
glimpse(chickens)
Rows: 38,170
Columns: 14
$ `Domain Code`      <chr> "QL", "QL", "QL", "QL", "QL", "QL", "QL", "QL", "QL…
$ Domain             <chr> "Livestock Primary", "Livestock Primary", "Livestoc…
$ `Area Code`        <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ Area               <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afgha…
$ `Element Code`     <dbl> 5313, 5410, 5510, 5313, 5410, 5510, 5313, 5410, 551…
$ Element            <chr> "Laying", "Yield", "Production", "Laying", "Yield",…
$ `Item Code`        <dbl> 1062, 1062, 1062, 1062, 1062, 1062, 1062, 1062, 106…
$ Item               <chr> "Eggs, hen, in shell", "Eggs, hen, in shell", "Eggs…
$ `Year Code`        <dbl> 1961, 1961, 1961, 1962, 1962, 1962, 1963, 1963, 196…
$ Year               <dbl> 1961, 1961, 1961, 1962, 1962, 1962, 1963, 1963, 196…
$ Unit               <chr> "1000 Head", "100mg/An", "tonnes", "1000 Head", "10…
$ Value              <dbl> 4000, 25000, 10000, 4400, 25000, 11000, 4600, 25000…
$ Flag               <chr> "F", "Fc", "F", "F", "Fc", "F", "F", "Fc", "F", "F"…
$ `Flag Description` <chr> "FAO estimate", "Calculated data", "FAO estimate", …
Code
glimpse(cattle)
Rows: 36,449
Columns: 14
$ `Domain Code`      <chr> "QL", "QL", "QL", "QL", "QL", "QL", "QL", "QL", "QL…
$ Domain             <chr> "Livestock Primary", "Livestock Primary", "Livestoc…
$ `Area Code`        <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ Area               <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afgha…
$ `Element Code`     <dbl> 5318, 5420, 5510, 5318, 5420, 5510, 5318, 5420, 551…
$ Element            <chr> "Milk Animals", "Yield", "Production", "Milk Animal…
$ `Item Code`        <dbl> 882, 882, 882, 882, 882, 882, 882, 882, 882, 882, 8…
$ Item               <chr> "Milk, whole fresh cow", "Milk, whole fresh cow", "…
$ `Year Code`        <dbl> 1961, 1961, 1961, 1962, 1962, 1962, 1963, 1963, 196…
$ Year               <dbl> 1961, 1961, 1961, 1962, 1962, 1962, 1963, 1963, 196…
$ Unit               <chr> "Head", "hg/An", "tonnes", "Head", "hg/An", "tonnes…
$ Value              <dbl> 700000, 5000, 350000, 700000, 5000, 350000, 780000,…
$ Flag               <chr> "F", "Fc", "F", "F", "Fc", "F", "F", "Fc", "F", "F"…
$ `Flag Description` <chr> "FAO estimate", "Calculated data", "FAO estimate", …
Code
glimpse(country)
Rows: 1,943
Columns: 7
$ `Country Group Code` <dbl> 5100, 5100, 5100, 5100, 5100, 5100, 5100, 5100, 5…
$ `Country Group`      <chr> "Africa", "Africa", "Africa", "Africa", "Africa",…
$ `Country Code`       <dbl> 4, 7, 53, 20, 233, 29, 35, 32, 37, 39, 24, 45, 46…
$ Country              <chr> "Algeria", "Angola", "Benin", "Botswana", "Burkin…
$ `M49 Code`           <chr> "012", "024", "204", "072", "854", "108", "132", …
$ `ISO2 Code`          <chr> "DZ", "AO", "BJ", "BW", "BF", "BI", "CV", "CM", "…
$ `ISO3 Code`          <chr> "DZA", "AGO", "BEN", "BWA", "BFA", "BDI", "CPV", …
Code
glimpse(livestock)
Rows: 82,116
Columns: 14
$ `Domain Code`      <chr> "QA", "QA", "QA", "QA", "QA", "QA", "QA", "QA", "QA…
$ Domain             <chr> "Live Animals", "Live Animals", "Live Animals", "Li…
$ `Area Code`        <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ Area               <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afgha…
$ `Element Code`     <dbl> 5111, 5111, 5111, 5111, 5111, 5111, 5111, 5111, 511…
$ Element            <chr> "Stocks", "Stocks", "Stocks", "Stocks", "Stocks", "…
$ `Item Code`        <dbl> 1107, 1107, 1107, 1107, 1107, 1107, 1107, 1107, 110…
$ Item               <chr> "Asses", "Asses", "Asses", "Asses", "Asses", "Asses…
$ `Year Code`        <dbl> 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 196…
$ Year               <dbl> 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 196…
$ Unit               <chr> "Head", "Head", "Head", "Head", "Head", "Head", "He…
$ Value              <dbl> 1300000, 851850, 1001112, 1150000, 1300000, 1200000…
$ Flag               <chr> NA, NA, NA, "F", NA, NA, NA, NA, NA, NA, NA, NA, "F…
$ `Flag Description` <chr> "Official data", "Official data", "Official data", …

There’s clearly a lot going on with these data, but using the glimpse() function, as well as scrolling through the data, allows us to get more of a handle on the datasets. The columns containing Code appear to be redundant with another column, so we can likely ignore them for now.

For now, we can focus on birds, cattle, chickens, and livestock.

In all 4 of these datasets, the Area column indicates the location of the agricultural product. Element indicates the type of product, Item indicates the animal. Year indicates the year of estimate. Unit indicates the unit of measurement.

These data are “long” or “tidy”. Don’t worry if you don’t know what this means, yet - we haven’t gotten there in the course. Each row appears to compromise a case - a single measurement. A single row will contain a measurement of an agricultural product from a single year and in a single country.

The country dataset is less interesting. It only contains the codes to match Country and Country groups. We may need to go to the FAOSTAT website to figure this out more. For now, we’ll move on to the next dataset.

The “wild_bird_data” sheet is in Excel format (.xlsx) instead of the .csv format of the earlier data sets. In theory, it should be no harder to read in than an Excel worksheet (or even workbook) as compared to a .csv file - there is a package called read_xl that is part of the tidyverse that easily reads in excel files.

However, in practice, most people use Excel sheets as a publication format - not a way to store data, so there is almost always a ton of “junk” in the file that is NOT part of the data table that we want to read in. Sometimes the additional “junk” is incredibly useful - it might include table notes or information about data sources. However, we still need a systematic way to identify this junk and get rid of it during the data reading step.

For example, lets see what happens here if we just read in the wild bird data straight from excel.

Code
wildbirds <- here("posts","_data","wild_bird_data.xlsx") %>%
  read_excel()
wildbirds

Hm, this doesn’t seem quite right. It is clear that the first “case” has information in it that looks more like variable labels. Lets take a quick look at the raw data.

Wild Bird Excel File

Sure enough the Excel file first row does contain additional information, a pointer to the article that this data was drawn from, and a quick Google reveals the article is [Nee, S., Read, A., Greenwood, J. et al. The relationship between abundance and body size in British birds. Nature 351, 312–313 (1991)] (https://www.nature.com/articles/351312a0)

Skipping a row

We could try to manually adjust things - remove the first row, change the column names, and then change the column types. But this is both a lot of work, and not really a best practice for data management. Lets instead re-read the data in with the skip argument from read_excel, and see if it fixes all of our problems!

Code
wildbirds <- here("posts","_data","wild_bird_data.xlsx") %>%
  read_excel(skip = 1)
wildbirds 

This now looks great! Both variables are numeric, and now they correctly show up as double or (). The variable names might be a bit tough to work with, though, so it can be easier to assign new column names on the read in - and then manually adjust axis labels, etc once you are working on your publication-quality graphs.

Note that I skip two rows this time, and apply my own column names.

Code
wildbirds <- here("posts","_data","wild_bird_data.xlsx") %>% 
  read_excel(skip = 2,col_names = c("weight", "pop_size"))
wildbirds

The data are pretty straightforward to interpret.

Code
glimpse(wildbirds)
Rows: 146
Columns: 2
$ weight   <dbl> 5.458872, 7.764568, 8.638587, 10.689735, 7.417226, 9.116935, …
$ pop_size <dbl> 532194.3951, 3165107.4454, 2592996.8678, 3524193.2266, 389806…

Each row is a single case, with measurements of weight and population size for (presumably) a single species of bird.

We may need to take a look at the publication if we want to figure out the species’ name. This is above and beyond this challenge, so we will move on.
## Railroad (xls) ⭐⭐⭐⭐

The railroad data set is our most challenging data to read in this week, but is (by comparison) a fairly straightforward formatted table published by the Railroad Retirement Board. The value variable is a count of the number of employees in each county and state combination. Railroad Employment

Looking at the excel file, we can see that there are only a few issues: 1. There are three rows at the top of the sheet that are not needed 2. There are blank columns that are not needed. 3. There are Total rows for each state that are not needed

Skipping title rows

For the first issue, we use the “skip” option on read_excel from the readxl package to skip the rows at the top.

Code
here("posts","_data","StateCounty2012.xls") %>%
  read_excel(skip=3)
New names:
• `` -> `...2`
• `` -> `...4`

Removing empty columns

For the second issue, I name the blank columns “delete” to make is easy to remove the unwanted columns. I then use select (with the ! sign to designate the complement or NOT) to select columns we wish to keep in the dataset - the rest are removed. Note that I skip 4 rows this time as I do not need the original header row.

There are other approaches you could use for this task (e.g., remove all columns that have no valid volues), but hard coding of variable names and types during data read in is not considered a violation of best practices and - if used strategically - can often make later data cleaning much easier.

Code
here("posts","_data","StateCounty2012.xls")  %>%
  read_excel(skip = 4,
                     col_names= c("State", "delete", "County", "delete", "Employees"))%>%
  select(-contains("delete"))
New names:
• `delete` -> `delete...2`
• `delete` -> `delete...4`

Filtering “total” rows

For the third issue, we are going to use filter to identify (and drop the rows that have the word “Total” in the State column). str_detect can be used to find specific rows within a column that have the designated “pattern”, while the “!” designates the complement of the selected rows (i.e., those without the “pattern” we are searching for.)

The str_detect command is from the stringr package, and is a powerful and easy to use implementation of grep and regex in the tidyverse - the base R functions (grep, gsub, etc) are classic but far more difficult to use, particularly for those not in practice. Be sure to explore the stringr package on your own.

Code
railroad <- here("posts","_data","StateCounty2012.xls") %>%
  read_excel(skip = 4,col_names= c("State", "delete", "County", "delete", "Employees"))%>%
  select(!contains("delete"))%>%
  filter(!str_detect(State, "Total"))
New names:
• `delete` -> `delete...2`
• `delete` -> `delete...4`
Code
railroad

Remove any table notes

Tables often have notes in the last few table rows. You can check table limits and use this information during data read-in to not read the notes by setting the n-max option at the total number of rows to read, or less commonly, the range option to specify the spreadsheet range in standard excel naming (e.g., “B4:R142”). If you didn’t handle this on read in, you can use the tail command to check for notes and either tail or head to keep only the rows that you need.

Code
tail(railroad, 10)
Code
#remove the last two observations
railroad <- head(railroad, -2)
tail(railroad, 10)

The range approach

We can manually specify the range of cells we want to read in using the range argument. To do so, you’ll need to open the file up in Excel (or a similar program) and figure this out on your own.

Code
railroad_new <- here("posts","_data","StateCounty2012.xls") %>%
  read_excel(range = "B4:F2990", col_names= c("State", "delete", "County", "delete", "Employees")) %>%
  select(!contains("delete"))%>%
  filter(!str_detect(State, "Total"))
New names:
• `delete` -> `delete...2`
• `delete` -> `delete...4`
Code
railroad_new
Code
tail(railroad_new,10)

Confirm cases

And that is all it takes! The data are now ready for analysis. Lets see if we get the same number of unique states that were in the cleaned data in exercise 1.

Code
railroad%>%
  select(State)%>%
  n_distinct(.)
[1] 54
Code
railroad%>%
  select(State)%>%
  distinct()

Oh my goodness! It seems that we have an additional “State” - it looks like Canada is in the full excel data and not the tidy data. This is one example of why it is good practice to always work from the original data source!